We present a novel image inversion framework and a training pipeline to achieve high-fidelity image inversion with high-quality attribute editing. Inverting real images into StyleGAN's latent space is an extensively studied problem, yet the trade-off between the image reconstruction fidelity and image editing quality remains an open challenge. The low-rate latent spaces are limited in their expressiveness power for high-fidelity reconstruction. On the other hand, high-rate latent spaces result in degradation in editing quality. In this work, to achieve high-fidelity inversion, we learn residual features in higher latent codes that lower latent codes were not able to encode. This enables preserving image details in reconstruction. To achieve high-quality editing, we learn how to transform the residual features for adapting to manipulations in latent codes. We train the framework to extract residual features and transform them via a novel architecture pipeline and cycle consistency losses. We run extensive experiments and compare our method with state-of-the-art inversion methods. Qualitative metrics and visual comparisons show significant improvements. Code: https://github.com/hamzapehlivan/StyleRes
translated by 谷歌翻译
本文对实例分割模型进行了全面评估,这些模型与现实世界图像损坏以及室外图像集合,例如与培训数据集不同的设置捕获的图像。室外图像评估显示了模型的概括能力,现实世界应用的一个基本方面以及广泛研究的域适应性主题。当设计用于现实世界应用程序的实例分割模型并选择现成的预期模型以直接用于手头的任务时,这些提出的鲁棒性和泛化评估很重要。具体而言,这项基准研究包括最先进的网络架构,网络骨架,标准化层,从头开始训练的模型,从头开始与预处理的网络以及多任务培训对稳健性和概括的影响。通过这项研究,我们获得了一些见解。例如,我们发现组归一化增强了跨损坏的网络的鲁棒性,其中图像内容保持不变,但损坏却添加在顶部。另一方面,分批归一化改善了图像特征统计信息在不同数据集上的概括。我们还发现,单阶段探测器比其训练大小不太概括到更大的图像分辨率。另一方面,多阶段探测器可以轻松地用于不同尺寸的图像上。我们希望我们的全面研究能够激发更强大和可靠的实例细分模型的发展。
translated by 谷歌翻译
Conventional methods for human motion synthesis are either deterministic or struggle with the trade-off between motion diversity and motion quality. In response to these limitations, we introduce MoFusion, i.e., a new denoising-diffusion-based framework for high-quality conditional human motion synthesis that can generate long, temporally plausible, and semantically accurate motions based on a range of conditioning contexts (such as music and text). We also present ways to introduce well-known kinematic losses for motion plausibility within the motion diffusion framework through our scheduled weighting strategy. The learned latent space can be used for several interactive motion editing applications -- like inbetweening, seed conditioning, and text-based editing -- thus, providing crucial abilities for virtual character animation and robotics. Through comprehensive quantitative evaluations and a perceptual user study, we demonstrate the effectiveness of MoFusion compared to the state of the art on established benchmarks in the literature. We urge the reader to watch our supplementary video and visit https://vcai.mpi-inf.mpg.de/projects/MoFusion.
translated by 谷歌翻译
Dynamic neural networks (DyNNs) have become viable techniques to enable intelligence on resource-constrained edge devices while maintaining computational efficiency. In many cases, the implementation of DyNNs can be sub-optimal due to its underlying backbone architecture being developed at the design stage independent of both: (i) the dynamic computing features, e.g. early exiting, and (ii) the resource efficiency features of the underlying hardware, e.g., dynamic voltage and frequency scaling (DVFS). Addressing this, we present HADAS, a novel Hardware-Aware Dynamic Neural Architecture Search framework that realizes DyNN architectures whose backbone, early exiting features, and DVFS settings have been jointly optimized to maximize performance and resource efficiency. Our experiments using the CIFAR-100 dataset and a diverse set of edge computing platforms have seen HADAS dynamic models achieve up to 57% energy efficiency gains compared to the conventional dynamic ones while maintaining the desired level of accuracy scores. Our code is available at https://github.com/HalimaBouzidi/HADAS
translated by 谷歌翻译
One of the main problems in applying deep learning techniques to recognize activities of daily living (ADLs) based on inertial sensors is the lack of appropriately large labelled datasets to train deep learning-based models. A large amount of data would be available due to the wide spread of mobile devices equipped with inertial sensors that can collect data to recognize human activities. Unfortunately, this data is not labelled. The paper proposes DISC (Deep Inertial Sensory Clustering), a DL-based clustering architecture that automatically labels multi-dimensional inertial signals. In particular, the architecture combines a recurrent AutoEncoder and a clustering criterion to predict unlabelled human activities-related signals. The proposed architecture is evaluated on three publicly available HAR datasets and compared with four well-known end-to-end deep clustering approaches. The experiments demonstrate the effectiveness of DISC on both clustering accuracy and normalized mutual information metrics.
translated by 谷歌翻译
Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License.
translated by 谷歌翻译
Self-attention is of vital importance in semantic segmentation as it enables modeling of long-range context, which translates into improved performance. We argue that it is equally important to model short-range context, especially to tackle cases where not only the regions of interest are small and ambiguous, but also when there exists an imbalance between the semantic classes. To this end, we propose Masked Supervised Learning (MaskSup), an effective single-stage learning paradigm that models both short- and long-range context, capturing the contextual relationships between pixels via random masking. Experimental results demonstrate the competitive performance of MaskSup against strong baselines in both binary and multi-class segmentation tasks on three standard benchmark datasets, particularly at handling ambiguous regions and retaining better segmentation of minority classes with no added inference cost. In addition to segmenting target regions even when large portions of the input are masked, MaskSup is also generic and can be easily integrated into a variety of semantic segmentation methods. We also show that the proposed method is computationally efficient, yielding an improved performance by 10\% on the mean intersection-over-union (mIoU) while requiring $3\times$ less learnable parameters.
translated by 谷歌翻译
Previous virtual try-on methods usually focus on aligning a clothing item with a person, limiting their ability to exploit the complex pose, shape and skin color of the person, as well as the overall structure of the clothing, which is vital to photo-realistic virtual try-on. To address this potential weakness, we propose a fill in fabrics (FIFA) model, a self-supervised conditional generative adversarial network based framework comprised of a Fabricator and a unified virtual try-on pipeline with a Segmenter, Warper and Fuser. The Fabricator aims to reconstruct the clothing image when provided with a masked clothing as input, and learns the overall structure of the clothing by filling in fabrics. A virtual try-on pipeline is then trained by transferring the learned representations from the Fabricator to Warper in an effort to warp and refine the target clothing. We also propose to use a multi-scale structural constraint to enforce global context at multiple scales while warping the target clothing to better fit the pose and shape of the person. Extensive experiments demonstrate that our FIFA model achieves state-of-the-art results on the standard VITON dataset for virtual try-on of clothing items, and is shown to be effective at handling complex poses and retaining the texture and embroidery of the clothing.
translated by 谷歌翻译
家庭中的移动操纵器可以为患有严重运动障碍的人提供越来越多的自治权,他们在没有照料者的帮助下通常无法完成日常生活(ADL)的活动。辅助移动操纵器的远距离运行可以使患有运动障碍的人能够独立执行自我保健和家庭任务,但是有限的运动功能会阻碍人们与机器人接触的能力。在这项工作中,我们介绍了一个独特的基于惯性的可穿戴辅助界面,该辅助界面嵌入了熟悉的头饰服装中,适用于具有严重运动障碍的人,可以通过移动操纵器进行远程处理和执行身体任务。我们评估了这种可穿戴的界面(n = 16)和有运动障碍的个体(n = 2),用于执行ADL和日常家庭任务。我们的结果表明,可穿戴界面使参与者能够完成错误率,高度可感知的易用性和低工作负载度量的身体任务。总体而言,这种基于惯性的可穿戴设备是一种新的辅助接口选项,可控制家庭中移动操纵器。
translated by 谷歌翻译
自主表面容器(ASV)代表了自动化湖泊水质监测的有前途的技术。在这项工作中,我们使用卫星图像作为粗图,并计划机器人的采样路线。但是,卫星图像与实际湖泊之间的不一致以及环境干扰(例如风,水生植被和不断变化的水位)可能使机器人难以参观先前地图建议的地方。本文提出了一种强大的路线规划算法,鉴于这些环境干扰,该算法可最大程度地减少预期的总行驶距离,从而引起地图中的不确定性。我们验证了算法在一千多个加拿大湖泊中的模拟中的功效,并在加拿大安大略省北部的一个湖泊中证明了我们在3.7 km长的现实世界机器人实验中应用算法的应用。
translated by 谷歌翻译